This is one page of the R Handbook for Epidemiologists, but is being printed as a stand-alone page.

You can find the complete handbook on Github

Cleaning data

Overview

This page demonstrates common steps necessary to clean a dataset. It uses a simulated Ebola case linelist, which is used throughout the handbook.

HOW TO READ: To emphasize the tidyverse coding approach, each cleaning step is explained individually and then incorporated into a “cleaning pipeline” - a series of cleaning actions linked together sequentially through pipes (LINK TO PIPES). The pipe begins with the “raw” data (linelist_raw) and ends with a “clean” dataset (linelist).

The cleaning steps demonstrated include:

  • Loading the data
  • column name cleaning
  • column selection
  • Designating column classes
  • Filtering rows
  • Re-coding values
  • Creating groups (case_when())
  • Dealing with character case (upper, lower, title, etc.)
  • Factor columns

replace missing with dealing with cases (all lower, etc) case_when() factors

Preparation

Load packages

pacman::p_load(tidyverse,  # data manipulation and visualization
               janitor,    # data cleaning
               epitrix     # data cleaning
               )

Load data

Import the raw dataset using the import() function from the package rio. (LINK HERE TO IMPORT PAGE)

linelist_raw <- import("ebola_simulated.xlsx")

You can view the first 50 rows of the the original “raw” dataset below:

# display the linelist data as a table
DT::datatable(head(linelist_raw,50), rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

Cleaning pipeline

In epidemiological analysis and data processing, cleaning steps are often performed together and sequentially. In R this often manifests as a cleaning “pipeline”, where the raw dataset is passed or “piped” from one cleaning step to another. The chain often utilizes dplyr verbs and the magrittr pipe operator (see handbook page on dplyr and tidyverse coding style (LINK HERE).

In a cleaning pipeline the order of the steps is important. Cleaning steps may include:

  • Column names may be cleaned or changed
  • Rows may be filtered or added
  • Columns may be selected, added, transformed, or re-ordered
  • Values may be re-coded, cleaned, or grouped

Column names

Column names are used very often so they need to have “clean” syntax. We suggest the following:

  • Short names
  • No spaces (replaced with underscores (_),
  • No unusual characters (&, #…)
  • Similar style nomenclature (e.g. all date columns named like date_onset, date_report, date_death…)

The columns names of linelist_raw are below. We can see that there are some with spaces. We also have different naming patterns for dates (‘date onset’ and ‘infection date’).

Also note that in the raw data, the two final columns names were two merged cells with one name. The import() function used the name for the first of the two columns, and assigned the second column the name “…23” as it was then empty (referring to the 23rd column).

names(linelist_raw)
##  [1] "row_num"         "case_id"         "generation"      "infection date" 
##  [5] "date onset"      "hosp date"       "date_of_outcome" "outcome"        
##  [9] "gender"          "hospital"        "lon"             "lat"            
## [13] "infector"        "source"          "age"             "age_unit"       
## [17] "fever"           "chills"          "cough"           "aches"          
## [21] "vomit"           "merged_column"   "...23"
Note: For a column name that include spaces, surround the name with back-ticks, for example: linelist$`infection date`. On a keyboard, the back-tick (`) is different from the single quotation mark ('), and is sometimes on the same key as the tilde (~).

Automatic syntax cleaning

The function clean_names() from the package janitor standardizes column names and makes them unique by doing the following:

  • Converts all names to consist of only underscores, numbers, and letters
  • Accented characters are transliterated to ASCII (e.g. german o with umlaut becomes “o”, spanish “enye” becomes “n”)
  • Capitalization preference can be specified using the case = argument (“snake” is default, alternatives include “sentence”, “title”, “small_camel”…)
  • You can designate specific name replacements with the replace = argument (e.g. replace = c(onset = “date_of_onset”))
  • Here is an online vignette

Below, the cleaning pipeline begins by using clean_names() on the raw linelist.

# send the dataset through the function clean_names()
linelist <- linelist_raw %>% 
  janitor::clean_names()

# see the new names
names(linelist)
##  [1] "row_num"         "case_id"         "generation"      "infection_date" 
##  [5] "date_onset"      "hosp_date"       "date_of_outcome" "outcome"        
##  [9] "gender"          "hospital"        "lon"             "lat"            
## [13] "infector"        "source"          "age"             "age_unit"       
## [17] "fever"           "chills"          "cough"           "aches"          
## [21] "vomit"           "merged_column"   "x23"

NOTE: The column name “…23” was changed to “x23”.

Manual column name cleaning

Re-naming columns manually is often necessary. Below, re-naming is performed using the rename() function from the dplyr package, as part of a pipe chain. rename() uses the style “NEW = OLD”, the new column name is given before the old column name.

Below, a re-name command is added to the cleaning pipeline:

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome)

Now you can see that the columns names have been changed:

# display the linelist data as a table
DT::datatable(linelist, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

You can also rename by column position, instead of column name, for example:

rename(newNameForFirstColumn = 1,
       newNameForSecondColumn = 2)

Empty Excel column names

If you importing an Excel sheet with a missing column name, depending on the import function used, R will likely create a column name with a value like “…1” or “…2”. You can clean these names manually by referencing their position number (see above), or their name (linelist_raw$...1).

Merged Excel column names

Merged cells in an Excel file are a common occurrance when receiving data from field level. Merged cells can be nice for human reading of data, but cause many problems for machine reading of data. R cannot accomodate merged cells. If at all possible, try to change procedures so that data arrive in a tidy format without merged cells.

When using rio’s import() function, the value in a merged cell will be assigned to the first cell and subsequent cells will be empty.

One solution to deal with merged cells is to import the data with the function readWorkbook() from package openxlsx. Set the argument fillMergedCells = TRUE. This gives the value in a merged cell to all cells within the merge range.

linelist_raw <- openxlsx::readWorkbook(here("data", "ebola_simulated.xlsx"), fillMergedCells = TRUE)

DANGER: If column names are merged, you will end up with duplicate column names, which you will need to fix manually - R does not work well with duplicate column names! You can re-name them by referencing their position (e.g. column 5), as explained in the section on manual column name cleaning..

Skip import of column names

Sometimes, you may want to avoid importing a row of data. Using import() from the rio package on a .xlsx file, you can do this with the argument skip =. Provide the number of rows you want to skip.

linelist_raw <- import("ebola_simulated.xlsx", skip = 1)  # does not import header row

Second header row

You may need to avoid importing the second row of data, for example if it is a data dictionary row (as in the example linelist). This can be problematic because it can result in all columns being imported as class “character”. To solve this, you will likely need to import the data twice.

  1. Import the data in order to store the correct column names
  2. Import the data again, skipping the first two rows (header and second rows)
  3. Bind the correct names onto the reduced dataframe

The exact arguments used to bind the correct column names depends on the type of data file (.csv, .tsv, .xlsx, etc.). If using rio’s import() function, understand which function rio uses to import your data, and then give the appropriate argument to skip lines and/or designate the column names. See the handbook page on importing data (LINK) for details on rio.

For Excel files:

# For excel files
linelist_raw_names <- import("ebola_simulated.xlsx") %>% names()
linelist_raw <- import("ebola_simulated.xlsx", skip = 2, col_names = linelist_raw_names) # argument is 'col_names'

For CSV files:

# For csv files
linelist_raw_names <- import("ebola_simulated.csv") %>% names()
linelist_raw <- import("ebola_simulated.csv", skip = 2, col.names = linelist_raw_names) # argument is 'col.names'

Backup option - changing column names as a separate command

# assign/overwrite headers using the base 'colnames()' function
colnames(linelist_raw) <- linelist_raw_names

Bonus! If you do have a second row that is a data dictionary, you can easily create a proper data dictionary from it using the gather() command from the tidyr package.
source: https://alison.rbind.io/post/2018-02-23-read-multiple-header-rows/

TO DO

library(tidyr)
stickers_dict <- import("ebola_simulated.xlsx") %>% 
  clean_names() %>% 
  gather(variable_name, variable_description)
stickers_dict

Combine two header rows

In some cases, you may want to combine two header rows into one. This command will define the column names as the combination (pasting together) of the existing column names with the value underneath in the first row.

names(df) <- paste(names(df), df[1, ], sep = "_")

Select or re-order columns

CAUTION: This tab may follow from previous tabs.

Often the first step of cleaning data is selecting the columns you want to work with, and to set their order in the dataframe. In a dplyr chain of verbs, this is done with select(). Note that in these examples we modify linelist with select(), but do not assign/overwrite. We just display the resulting new column names, for purpose of example.

CAUTION: In the examples below, linelist is modified with select() but not over-written. New column names are only displayed for purpose of example.

Here are all the column names in the linelist:

names(linelist)
##  [1] "row_num"              "case_id"              "generation"          
##  [4] "date_infection"       "date_onset"           "date_hospitalisation"
##  [7] "date_outcome"         "outcome"              "gender"              
## [10] "hospital"             "lon"                  "lat"                 
## [13] "infector"             "source"               "age"                 
## [16] "age_unit"             "fever"                "chills"              
## [19] "cough"                "aches"                "vomit"               
## [22] "merged_column"        "x23"

With select() you can do the following:

Select only the columns you want to remain, and their order of appearance

# linelist dataset is piped through select() command, and names() prints just the column names
linelist %>% 
  select(case_id, date_onset, date_hospitalisation, fever) %>% 
  names() # display the column names
## [1] "case_id"              "date_onset"           "date_hospitalisation"
## [4] "fever"

Indicate which columns to remove by placing a minus symbol “-” in front of the column name (e.g. select(-outcome)), or a vector of column names (as below). All other columns will be retained. Inside select() you can use normal operators such as c() to list several columns, : for consecutive columns, ! for opposite, & for AND, and | for OR.

linelist %>% 
  select(-c(date_onset, fever:vomit)) %>% # remove onset and all symptom columns
  names()
##  [1] "row_num"              "case_id"              "generation"          
##  [4] "date_infection"       "date_hospitalisation" "date_outcome"        
##  [7] "outcome"              "gender"               "hospital"            
## [10] "lon"                  "lat"                  "infector"            
## [13] "source"               "age"                  "age_unit"            
## [16] "merged_column"        "x23"

Re-order the columns - use everything() to signify all other columns not specified in the select() command:

# move case_id, date_onset, date_hospitalisation, and gender to beginning
linelist %>% 
  select(case_id, date_onset, date_hospitalisation, gender, everything()) %>% 
  names()
##  [1] "case_id"              "date_onset"           "date_hospitalisation"
##  [4] "gender"               "row_num"              "generation"          
##  [7] "date_infection"       "date_outcome"         "outcome"             
## [10] "hospital"             "lon"                  "lat"                 
## [13] "infector"             "source"               "age"                 
## [16] "age_unit"             "fever"                "chills"              
## [19] "cough"                "aches"                "vomit"               
## [22] "merged_column"        "x23"

As well as everything() there are several special functions that work within select(), namely:

  • everything() - all other columns not mentioned
  • last_col() - the last column
  • where() - applies a function to all columns and selects those which are TRUE
  • starts_with() - matches to a specified prefix. Example: select(starts_with("date"))
  • ends_with() - matches to a specified suffix. Example: select(ends_with("_end"))
  • contains() - columns containing a character string. Example: select(contains("time"))
  • matches() - to apply a regular expression (regex). Example: select(contains("[pt]al"))
  • num_range() -
  • any_of() - matches if column is named. Useful if the name might not exist. Example: select(any_of(date_onset, date_death, cardiac_arrest))

Here is an example using where():

# select columns containing certain characters
linelist %>% 
  select(contains("date")) %>% 
  names()
## [1] "date_infection"       "date_onset"           "date_hospitalisation"
## [4] "date_outcome"
# searched for multiple character matches
linelist %>% 
  select(matches("onset|hosp|fev")) %>%   # note the OR symbol "|"
  names()
## [1] "date_onset"           "date_hospitalisation" "hospital"            
## [4] "fever"

Adding select()to the cleaning pipe chain:

In the linelist, there are a few columns we do not need: row_num, merged_column, and x23. Remove them by adding a select() command to the cleaning pipe chain:

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
    
    # remove column
    select(-c(row_num, merged_column, x23))

select() as a stand-alone command

Select can also be used as an independent command (not in a pipe chain). In this case, the first argument is the original dataframe to be operated upon.

# Create a new linelist with id and age-related columns
linelist_age <- select(linelist, case_id, contains("age"))

# display the column names
names(linelist_age)
## [1] "case_id"  "age"      "age_unit"

Add columns and rows

See the tabs below to add columns and rows

Add columns

mutate()

We advise creating new columns with dplyr functions as part of a chain of such verb functions (e.g. filter, mutate, etc.)
If in need of a stand-alone command, you can use mutate() or the base R style to create a new column (see below).

The verb mutate() is used to add a new column, or to modify an existing one. Below are some example of creating new columns with mutate(). The syntax is: new_column_name = value or function. It is best practice to separate each new column with a comma and new line.

linelist <- linelist %>%                       # creating new, or modifying old dataset
  mutate(new_var_dup    = case_id,             # new column = duplicate/copy another column
         new_var_static = 7,                   # new column = all values the same
         new_var_static = new_var_static + 5,  # you can overwrite a column, and it can be a calculation using other variables
         new_var_paste  = stringr::str_glue("{hospital} on ({date_hospitalisation})") # new column = pasting together values from other columns
         ) 

Scroll to the right to see the new columns:

# display the linelist data as a table
DT::datatable(linelist, rownames = FALSE, filter="top", options = list(pageLength = 5, scrollX=T) )

TIP: The verb transmute() adds new columns just like mutate() but also drops/removes all other columns that you do not mention.

New columns using base R

To define a new column (or re-define a column) using base R, just use the assignment operator as below. Remember that when using base R you must specify the dataframe before writing the column name (e.g. dataframe$column). Here are two dummy examples:

linelist$old_var <- linelist$old_var + 7
linelist$new_var <- linelist$old_var + linelist$age

Add rows

TO DO

Remember that each column must contain values of only one class (either character, numeric, logical, etc.). So adding a row requires nuance to maintain this.

linelist <- linelist %>% 
  add_row(row_num = 666, case_id = "abc", generation = 4, `infection date` = as.Date("2020-10-10"), .before = 2)

use .before and .after. .before = 3 will put it before the 3rd row. Default is to add it to the end. columns not specified will be let empty. The new row number may look strange (“…23”) but the row numbers have changed. So if using the command twice examine/test carefully.

If your class is off you will see an error like this: Error: Can’t combine ..1$infection date and ..2$infection date . (for a date value remember to wrap the date in the functionas.Date() like as.Date("2020-10-10"))

Fix classes

CAUTION: This tab may follow from previous tabs.

See section on object classes

Often you will need to set the correct class for a column. The most common approach is to use mutate() to define the column as itself, but with a different class.

First we run some checks on the classes of important columns.

The class of the “age” column is character. To perform analysis, we need those numbers to be recognized as numeric!

class(linelist$age)
## [1] "character"

The class of the “date_onset” column is also character! To perform analysis, these dates must be recognized as dates!

class(linelist$date_onset)
## [1] "character"

However, if we try to classify this column as date, we would get an error. Use table() or sort or another method to examine all the values and identify different one. For example in our dataset we see that we see that one date_onset value was entered in a different format (15th April 2014) than all the other values!

## 
## 15th April 2014      2012-05-01      2012-06-17      2012-06-21      2012-06-24 
##               1               1               1               1               2 
##      2012-06-27 
##               1

Before we can classify “date_onset” as a date, this value must be fixed to be the same format as the others. You can fix the date in the source data, or, we can do in the cleaning pipeline via mutate() and recode(). This must be done before the commands to convert to class Date. (LINK TO DATE SECTION).

The mutate() line can be read as: “mutate date_onset to equal date_onset recoded so that OLD VALUE is changed to NEW VALUE”. Note that this pattern (OLD = NEW) for recode() is the opposite of most R patterns (new = old). The R development community is working on revising this for recoding.

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
  
    # remove column
    select(-row_num) %>% 

# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
  ###################################################
  # fix incorrect values                 # old value       # new value
  mutate(date_onset = recode(date_onset, "15th April 2014" = "2014-04-15")) %>% 
  
  # correct the class of the columns
  mutate(age           = as.numeric(age),
         date_onset    = as.Date(date_onset, format = "%Y-%m-%d"))

Especially after converting to class date, check your data visually or with table() to confirm that they were converted correctly! For as.Date(), the format = argument is often a source of errors.

Fix class for multiple columns at once

class(linelist$date_infection)
## [1] "character"
head(linelist$date_infection)
## [1] "2014-04-09" NA           NA           "2014-05-07" NA          
## [6] "2014-05-06"

You can use The dplyr function across() with mutate() to convert several columns at once to a new class. across() allows you to specify which columns you want a function to apply to. Below, we want to mutate the columns where is.POSIXct() (a type of date/time class that shows unnecessary timestamps) is TRUE, and apply the function is.Date() to them, in order to convert them to class “date”.

  • Note that within across() we also use the function where().
  • Note that is.POSIXct is from the package lubridate. Other similar functions (is.character(), is.numeric(), and is.logical()) are from base R
  • Note that the functions in across() are written without the empty parentheses ()
linelist <- linelist %>% 
  mutate(across(where(lubridate::is.POSIXct), as.Date))

Recoding values

blah blah blah TO DO

Manual recoding

mutate() is also used to recode the values in a column. For example, in linelist the values in the column “hospital” must be cleaned. There are several incorrect spellings, and many missing values.

table(linelist$hospital, useNA = "always")
## 
##                            Connaught Hopital 
##                                           48 
##                           Connaught Hospital 
##                                         1757 
##                                   Hospital A 
##                                           54 
##                                   Hospital B 
##                                           54 
##                             Military Hopital 
##                                           31 
##                            Military Hospital 
##                                          802 
##                             Mitylira Hopital 
##                                            1 
##                            Mitylira Hospital 
##                                           80 
##                                        other 
##                                          905 
##  Princess Christian Maternity Hopital (PCMH) 
##                                           11 
## Princess Christian Maternity Hospital (PCMH) 
##                                          421 
##                               Rokupa Hopital 
##                                           11 
##                              Rokupa Hospital 
##                                          452 
##                                         <NA> 
##                                         1500

Using recode()

To change spellings manually, one-by-one, you can use the recode() function within the mutate function. The code is saying that the column “hospital” should be defined as the current column “hospital”, but with certain changes (the syntax is OLD = NEW). Don’t forget commas!

linelist <- linelist %>% 
  mutate(hospital = recode(hospital,
                      # OLD = NEW
                      "Mitylira Hopital"  = "Military Hospital",
                      "Mitylira Hospital" = "Military Hospital",
                      "Military Hopital"  = "Military Hospital",
                      "Connaught Hopital" = "Connaught Hospital",
                      "Rokupa Hopital"    = "Rokupa Hospital",
                      "other"             = "Other",
                      "Princess Christian Maternity Hopital (PCMH)" = "Princess Christian Maternity Hospital (PCMH)"
                      ))

Now we see the values in the hospital column have been corrected:

table(linelist$hospital, useNA = "always")
## 
##                           Connaught Hospital 
##                                         1805 
##                                   Hospital A 
##                                           54 
##                                   Hospital B 
##                                           54 
##                            Military Hospital 
##                                          914 
##                                        Other 
##                                          905 
## Princess Christian Maternity Hospital (PCMH) 
##                                          432 
##                              Rokupa Hospital 
##                                          463 
##                                         <NA> 
##                                         1500

TIP: The number of spaces before and after an equals sign does not matter. Make your code easier to read by aligning the = for all or most rows. Also, consider adding a hashed comment row to clarify for future readers which side is OLD and which side is NEW.

TIP: Sometimes a blank character value exists in a dataset (not recognized as R’s value for missing - NA). You can reference this value with two quotation marks with no space inbetween ("").

Manual recode using base R

If you need to write a stand-alone command using base R (e.g. not part of a chain of dplyr verbs), then you can create a new column by assigning it a value. In the command below, the column new_var does not exist until after the command is executed. In this simple example the column is assigned the static value “new value”, so for all rows the value will be “new value”.

linelist_raw$new_var <- "new value"

If necessary, you make manual changes to a specific value in a dataframe by referencing the row number of case ID. But remember it is better if you can make these changes permanently in the underlying data!

Here is a fake example. It reads as “Change the value of the dataframe linelist‘s column onset_date (for the row where linelist’s column case_id has the value ’9d4019’) to as.Date("2020-10-24")”.

linelist$date_onset[linelist$case_id == "9d4019"] <- as.Date("2020-10-24")

Recoding by logic/condition

If you need to use logic statements to recode values, or want to use operators like %in%, use dplyr’s case_when() instead. If you use case_when() please read the thorough explanation HERE LINK, as there are important differences from recode() in syntax and logic order!

linelist <- linelist %>% 
  mutate(hospital = case_when(
    hospital == "Connaught Hopital"           ~ "Connaught Hospital",
    hospital == "Rokupa Hopital"              ~ "Rokupa Hospital", 
    hospital %in% c("Mitylira Hopital", 
      "Mitylira Hospital",
      "Mitylira Hospital",
      "Military Hopital")                     ~ "Military Hospital",
    is.na(hospital)                           ~ "Missing",
    stringr::str_detect(hospital, "Princess") ~ "Princess Christian Maternity Hospital (PCMH)",
    TRUE                                      ~  hospital) 
)

Table to confirm the values have been cleaned as expected:

table(linelist$hospital, useNA = "always")
## 
##                           Connaught Hospital 
##                                         1805 
##                                   Hospital A 
##                                           54 
##                                   Hospital B 
##                                           54 
##                            Military Hospital 
##                                          914 
##                                      Missing 
##                                         1500 
##                                        Other 
##                                          905 
## Princess Christian Maternity Hospital (PCMH) 
##                                          432 
##                              Rokupa Hospital 
##                                          463 
##                                         <NA> 
##                                            0

ifelse() and if_else()

For simple cases you can use ifelse() or if_else(). Though in most cases it is better to use case_when().

ifelse() and if_else():

These commands are simplified versions of an if and else statement. The general syntax is ifelse(condition, value if TRUE, value if FALSE). if_else() is a special version from dplyr that handles dates.

Stringing together ifelse statements - NOT ADVISED!! Difficult to read and keep track of.

IMAGE of ifelse string with X across is.

Use case-when() instead.

You can reference other columns with the ifelse() function within mutate():

linelist <- linelist %>% 
  mutate(date_death = if_else(outcome == "Death", date_outcome, NA_real_))

Special recoding functions

replace_na()

To change missing values to a character value, such as “Missing”, use the function replace_na() in the same manner as recode above:

linelist <- linelist %>% 
  mutate(hospital = replace_na(hospital, "Missing"))

na_if()

Likewise you can quickly convert character values to NA using na_if(), as below:

linelist <- linelist %>% 
  mutate(hospital = na_if(hospital, "Missing"))

coalesce()

This dplyr function finds the first non-missing value at each position. So, you provide it with columns and for each row it will fill the value with the first non-missing value in the columns you provided.

For example, you might use thiscoalesce()` create a “location” variable from hypothetical variables “patient_residence” and “reporting_jurisdiction”, where you prioritize patient residence information, if it exists.

linelist <- linelist %>% 
  mutate(location = coalesce(patient_residence, reporting_jurisdiction))

TO DO lead(), lag() cumsum(), cummean(), cummin(), cummax(), cumany(), cumall(),

Using cleaning dictionaries

CAUTION: This tab may follow from previous tabs.

## load cleaning rules and only keep columns in mll
mll_cleaning_rules <- import(here("dictionaries/mll_cleaning_rules.xlsx")) %>%
  filter(column %in% c(names(mll_raw), ".global"))

## define columns that are not cleand
unchanged <- c(
  "epilink_relationship",
  "narratives",
  "epilink_relationship_detail"
)

mll_clean <- mll_raw %>%
  ## convert to tibble
  as_tibble() %>%
  ## clean columns using cleaning rules
  clean_data(
    wordlists = mll_cleaning_rules,
    protect = names(.) %in% unchanged
  )

Add to the pipe chain

We will add the case_when() steps to the pipe chain, as case_when() is generally more flexible for the future.

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
  
    # remove column
    select(-row_num) %>% 

    # fix incorrect values                 # old value       # new value
    mutate(date_onset = recode(date_onset, "15th April 2014" = "2014-04-15")) %>% 
    
    # correct the class of the columns
    mutate(age           = as.numeric(age),
           date_onset    = as.Date(date_onset, format = "%Y-%m-%d")) %>% 

# ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
  ###################################################

    # clean values of hospital column
    mutate(hospital = case_when(hospital == "Connaught Hopital"                           ~ "Connaught Hospital",
                                hospital == "Rokupa Hopital"                              ~ "Rokupa Hospital", 
                                hospital %in% c("Mitylira Hopital", 
                                                "Mitylira Hospital",
                                                "Mitylira Hospital",
                                                "Military Hopital")                       ~ "Military Hospital",
                                is.na(hospital)                                           ~ "Missing",
                                hospital == "Princess Christian Maternity Hopital (PCMH)" ~ "Princess Christian Maternity Hospital (PCMH)",
                                TRUE                                                      ~  hospital))

Filter rows

CAUTION: This tab may follow from previous tabs.

After selecting columns, a typical cleaning step is to filter the dataframe for specific rows using the dplyr verb filter(). Within filter(), give the logic that must be TRUE for a row in the dataset to be kept.

The tabs below show how to filter rows based on simple and complex logical conditions, and how to filter/subset rows as a stand-alone command and with base R

Simple filter()

A simple example below re-defines the dataframe linelist as itself having filtered the rows to meet a logical condition. Only the rows where the logical statement within the parentheses is TRUE are kept. In this case, the logical statement is !is.na(case_id), which is asking whether the value in the column case_id is not missing. Thus, rows where case_id is not missing are kept.

Before the filter is applied, the number of rows in linelist is 6127.

linelist <- linelist %>% 
  filter(!is.na(case_id))  # keep only rows where case_id is not missing

After the filter is applied, the number of rows in linelist is 6124.

Complex filter()

A more complex example using filter():

Examine the data

Below is a simple one-line command to create a histogram of onset dates. See that a second smaller outbreak from 2012-2013 is also included in this dataset. For our analyses, we want to remove entries from this earlier outbreak.

hist(linelist$date_onset, breaks = 50)

#### Be aware how filters handle missing values

Can we just filter by onset_date to rows after June 2013? Caution! Applying filter(date_onset > as.Date("2013-06-01"))) would accidentally remove any rows in the later epidemic with a missing date of onset!

DANGER: Filtering to greater than (>) or less than (<) a date can remove any rows with missing date values (NA)! This is because NA is treated as infinitely large and small.

Design the filter

What other criteria can we filter on? We also happen to know that this first epidemic occurred at Hospital A, Hospital B, and that there were also 10 cases at Connaught Hospital. Hospitals A & B did not have cases in the second epidemic, but Connaught Hospital had many. This is a complex filter to apply - it is wise to cross-tabulate these columns to know exactly how many rows we expect should be removed.

Let’s examine a cross-tabulation to make sure we exclude only the correct rows:

table(Hospital  = linelist$hospital,                     # hospital name
      YearOnset = lubridate::year(linelist$date_onset),  # year of the date_onset
      useNA     = "always")                              # show missing values
##                                               YearOnset
## Hospital                                       2012 2013 2014 2015 <NA>
##   Connaught Hospital                              8    1 1377  339   80
##   Hospital A                                     34   18    0    0    2
##   Hospital B                                     39   12    0    0    3
##   Military Hospital                               0    0  676  199   39
##   Missing                                         0    0 1129  306   62
##   other                                           0    0  687  176   42
##   Princess Christian Maternity Hospital (PCMH)    0    0  327   90   15
##   Rokupa Hospital                                 0    0  347   97   19
##   <NA>                                            0    0    0    0    0

We want to exclude only the nrow(linelist %>% filter(hospital %in% c("Hospital A", "Hospital B") | date_onset < as.Date("2013-06-01"))) rows from 2012 and 2013 at those three hospitals (A, B, and Connaught), including the 2 from Hospitals A & B with missing onset dates, but not any others with missing onset dates. We start with a linelist of nrow(linelist). Here is our statement:

linelist <- linelist %>% 
  filter(date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B")))

nrow(linelist)
## [1] 6007

When we re-make the cross-tabulation, we see that Hospitals A & B are removed completely, the 10 Connaught Hospital cases from 2012 & 2013 are removed, and all other values are the same - just as we wanted.

table(Hospital  = linelist$hospital,                     # hospital name
      YearOnset = lubridate::year(linelist$date_onset),  # year of the date_onset
      useNA     = "always")                              # show missing values
##                                               YearOnset
## Hospital                                       2014 2015 <NA>
##   Connaught Hospital                           1377  339   80
##   Military Hospital                             676  199   39
##   Missing                                      1129  306   62
##   other                                         687  176   42
##   Princess Christian Maternity Hospital (PCMH)  327   90   15
##   Rokupa Hospital                               347   97   19
##   <NA>                                            0    0    0

Multiple filter statements can be separated by commas, or you can always pipe to a separate filter() statement for clarity. Adding these filter statements to the cleaning pipe chain now looks like this:

# CLEANING 'PIPE' CHAIN (starts with raw data and pipes it through cleaning steps)
##################################################################################
linelist <- linelist_raw %>%
    
    # standardize column name syntax
    janitor::clean_names() %>% 
    
    # manually re-name columns
           # NEW name             # OLD name
    rename(date_infection       = infection_date,
           date_hospitalisation = hosp_date,
           date_outcome         = date_of_outcome) %>% 
  
    # remove column
    select(-row_num) %>% 

    # fix incorrect values                 # old value       # new value
    mutate(date_onset = recode(date_onset, "15 April 2014" = "2014-04-15")) %>% 
    
    # correct the class of the columns
    mutate(age           = as.numeric(age),
           date_onset    = as.Date(date_onset, format = "%Y-%m-%d")) %>% 
    
  # ABOVE ARE UPSTREAM CLEANING STEPS ALREADY DISCUSSED
    ###################################################
    filter(!is.na(case_id),  # keep only rows where case_id is not missing
           date_onset > as.Date("2013-06-01") | (is.na(date_onset) & !hospital %in% c("Hospital A", "Hospital B"))
           ) # close filter

Now the number of rows in linelist is 6007.

Filter as a stand-alone command

Filtering can also be done as a stand-alone command (not part of a pipe chain). Like other dplyr verbs, in this case the first argument must be the dataset itself.

# dataframe <- filter(dataframe, condition(s) for rows to keep)

linelist <- filter(linelist, !is.na(case_id))

You can also use base R to subset using square brackets which reflect the [rows, columns] that you want to retain.

# dataframe <- dataframe[row conditions, column conditions] (blank means keep all)

linelist <- linelist[!is.na(case_id), ]

TIP: Use bracket-subset syntax with View() to quickly review a few records.

This base R syntax can be handy when you want to quickly view a subset of rows and columns. Use the base R View() command (note the capital “V”) around the [] subset you want to see. The result will appear as a dataframe in your RStudio viewer panel. For example, if I want to review onset and hospitalization dates of 3 specific cases:

View(linelist[linelist$case_id %in% c("11f8ea", "76b97a", "47a5f5"), c("date_onset", "date_hospitalisation")])

Groups by condition (case_when())

CAUTION: This tab may follow from previous tabs.

TODO tutorial on using case_when()

Numeric groups

For example, creating age groups cut()

case_when()

age_categories() (R4Epis package)

by percentile

WHAT TO DO IF AGE IS SPREAD ACROSS TWO VARAIBLES (e.g. numeric age + unit)

Highest in hierarchy

CAUTION: This tab may follow from previous tabs.

Within a group, indicate/convert to the highest value in the group

Santa Clara County example - COVID contact tracing data - classification of multiple phone call records from same person into the highest category. (classify all as the highest of the group)

Grouping values

CAUTION: This tab may follow from previous tabs.

Using mutate on GROUPED dataframes https://dplyr.tidyverse.org/reference/mutate.html

Taken from website above:

Because mutating expressions are computed within groups, they may yield different results on grouped tibbles. This will be the case as soon as an aggregating, lagging, or ranking function is involved. Compare this ungrouped mutate:

starwars %>%
  select(name, mass, species) %>%
  mutate(mass_norm = mass / mean(mass, na.rm = TRUE))
With the grouped equivalent:

starwars %>%
  select(name, mass, species) %>%
  group_by(species) %>%
  mutate(mass_norm = mass / mean(mass, na.rm = TRUE))
The former normalises mass by the global average whereas the latter normalises by the averages within species levels.

Transforming multiple variables at once

CAUTION: This tab may follow from previous tabs.

across dplyr

A transformation can be applied to multiple variables at once using the across() function from the package dplyr (contained within tidyverse package).

across() can be used with any dplyr verb, but commonly with as mutate(), filter(), or summarise(). Here are some examples to get started.

across() with mutate():

Change all columns to character class

#to change all columns to character class
linelist <- linelist %>% 
  mutate(across(everything(), as.character))

Change only numeric columns ```

Here are a few online resources on using across(): Hadley Wickham’s thoughts/rationale